23 Oct 2022

Today: Regression (part 2):

  • Review of the simple regression model
    • Step 1. What does the linear relationship look like (regression coefficients)?
    • Step 2. How strong is the linear relationship \(R^2\)?
    • Step 3. Statistical inference about coefficients (t-test and F-test)?
  • Standardized regression coefficients: \(\beta\)
  • Intro multiple regression analysis

Reference material:

  • Chapter 11: read 11.4-11.8, 11.11
  • Book 2(!): Chapter 4: read 4.1-4.3

Ptolemy

Thought experiment

If I tell you the average grade for this course from last year was…

\(\bar{Y} = 6.1\)

Then what grade would you reasonably expect to get?

Thought experiment

If I additionally tell you that there is a “strong effect” of hours studied on grade, and you know that you have studied more than average, does this change your expectation?

What does this show?

  1. The average value is the best prediction in absence of other information
  2. If you possess other variables that correlate with the outcome, you can use that information to make better predictions

Scatterplot

If there were no association, the mean \(\bar{Y}\) would be the best prediction:

Association between hours studied and grade

The predictions are a bit wrong for everyone:

Association between hours studied and grade

Fun fact: The average of these errors is the standard deviation of “Grade”

Association between hours studied and grade

But: The points seem to follow a diagonal line upwards, instead of the straight line of the mean:

Linear effect

The distances from a diagonal line are clearly smaller than from the straight line of the mean:

Linear effect

You can follow the line to see what grade to expect for your hours studied. These predictions are clearly better than the mean:

Formula

A diagonal line is described as:

\(Y = a + bX\)

Coefficients

The formula for a line is

\(Y = a + bX\)

\(a\) is the intercept, where the line intersects the Y-axis.

  • Predicted value for X is 0

\(b\) is the slope, how steep the line is

  • The average increase in Y, for 1 step increase in X

Prediction error

The prediction \(\hat{Y}_i\) is rarely exactly identical to the grade of an individual student, \(Y_i\)

There is always prediction error, \(Y_i - \hat{Y}_i\):

Estimating the coefficients

Plugging the estimated coefficients into the formula:

\(\hat{Y}_i = 2.9 + 0.8*X_i\)

Student 71 studied 4 hours, so the predicted grade is (\(\hat{Y}_{71}\)):

\(\hat{Y}_{71} = 2.9 + 0.8 * 4 = 6.1\)

In reality, student 71 scored a 8.8, so the prediction error was \(Y_i - \hat{Y}_i = 8.8 - 6.1 = 2.7\)

Complete regression formula

The formula \(Y = a + bX\) describes the AVERAGE line through the data.

We can include the prediction error in the formula:

\(Y_i = a + b*X_i +\epsilon_{i}\)

We usually assume that \(\epsilon_{i} \sim N(0, S_{e})\)

Complete regression formula

\(Y_i = a + b*X_i +\epsilon_{i}\)

Symbol Meaning
\(Y_i\) The value of dependent variable Y for individual i
\(a\) Intercept of the regression line
\(b\) Slope of the regression line
\(X_i\) Value of predictor variable X for individual i
\(\epsilon_i\) Prediction error for individual i

Observed and predicted values

“The individual values of Y are equal to the intercept, plus the slope times the individual value on X, plus the individual prediction error”

\(Y_i = a + b*X_i +\epsilon_{i}\)

And also:

“The individual values of Y are equal to the predicted value, plus individual prediction error”

\(Y_i = \hat{Y}_i + \epsilon_{i}\)

The predicted value is the value on the regression line: \(\hat{Y}_i = a + b*X_i\)

Sums of squares

How much error is there in total?

We want to know how good/bad the model is for all participants.

Can we just add the prediction errors for all 92 participants?

Sum of Squared Errors

Because the line is exactly in the middle of the data, the average of all prediction errors is always 0

Sum of positive prediction errors: 36.25

Sum of prediction errors: -36.25

Sum of Squared Errors (SSE)

By squaring the prediction errors, we always get a positive sum which reflects the total error across all participants.

Sum of squared prediction errors (sum of squared errors, SSE):

\[ \sum{(Y_i - \hat{Y}_i)^2} = 84.18 \]

This is called a sum of squares, and its formula always looks like:

\[ \sum(\dots-\dots)^2 \]

Smallest possible SSE

Linear regression by definition gives the line with the smallest possible total prediction error

The method is called “ordinary least squares” (= squared errors).

How good is our regression line?

To decide how good our model is, we compare the SSE against the sum of squares you would get if there were NO association between the predictor and outcome

Remember: What value would you predict for everyone if there were no association between Hours en Grade?

Total Sum of Squares

You would predict the mean Grade for everyone, \(\bar{Y}\).

The sum of squared distances of individual scores to the mean is called the Total Sum of Squares, TSS: \(\sum{(Y_i-\bar{Y}_i)^2} = 255.82\)

Regression Sum of Squares

The improvement in predictions made by the regression line, compared to the mean, is called Regression Sum of Squares, RSS:

The difference between the regression line and the mean:

\[ \sum{(\hat{Y}_i-\bar{Y})^2} \]

Is the same as: Total SS - Error SS = Regression SS

255.82 - 84.18 = 171.64

Demo sum of squares

Demo sum of squares

Sum of squares formulas

Sum Formula Same as
SSE \(\sum{(Y_i - \hat{Y}_i)^2}\) SST - SSR
SST \(\sum{(Y_i - \bar{Y})^2}\) SSR + SSE
SSR \(\sum{(\hat{Y}_i - \bar{Y})^2}\) SST-SSE

Explained variance

Explained variance

Sums of squares are not interpretable on any meaningful scale, and cannot be compared across datasets

Solution: We standardize the sums of squares.

Explained variance

Which part of the TOTAL sum of squares (TSS) is explained away by the regression line (RSS)?

\(\frac{RSS}{TSS} = \frac{TSS-SSE}{TSS} = R^2\)

\(R^2\) is called the proportion of explained variance

For two variables, that is the same as the correlation (\(r\)) squared; hence R-squared

Explained variance

Which portion of the total variance in the outcome is explained by the values on the predictor?

(See demo again)

In this case: \(\frac{171.64}{255.82} = 0.67\)

Tests

Is the regression significant?

Does the regression-line explain significantly more variance than the mean-line?

Formulate your hypotheses:

  • \(H_0\): \(R^2 = 0\)
  • \(H_A\): \(R^2 > 0\)

Important:

\(R^2\) can only have positive values, so we need a test statistic that can only take positive values

F-Test

We use the F-distribution

F-Test

The F-test is a ratio of two sources of variance:

\[ F = \frac{\sigma^2_{\text{regression}}}{\sigma^2_{\text{Error}}} = \frac{SSR/(p-1)}{SSE/(n-p)} \] p is the number of parameters (coefficients) in the regression equation (intercept and slope), and n is the number of participants

  • \(df_1\): p-1
  • \(df_2\): n-p

Reporting

The regression model explained a significant proportion of the variance in the outcome, \(R^2 = 0.67, F(1, 90) = 183.51, p < .001.\) This means that the number of hours studied explained 0.67*100% of the variance in exam grades.

Testing coefficients

Similarly, you can test whether the coefficients (a and b) are significantly different from 0.

Can the intercept (a) and slope (b) take positive/negative values?

Testing coefficients

Because the intercept (a) and slope (b) can take positive/negative values, we use the t-distribution:

  • \(H_0\): \(b = 0\)
  • \(H_A\): \(b \neq 0\)

\[ t = \frac{b-b_{H0}}{SE_b} \]

  • \(df\): n - p - 1

(p: number of predictors, minus one for the intercept)

Test

Reporting

The effect of hours studied on exam grade was significantly different from zero, \(b = 0.78, t(90) = 13.55, p < .001.\) This means that for every additional hours studied, the average grade increased by 0.78.

In SPSS

Assumptions

Assumption: Normality

The prediction errors are normally distributed with a mean of 0. Is dit normaal?

\(\epsilon_i \sim N(0, \sigma)\)

Assumption: Homoscedasticity

The prediction errors are distributed identically for all values of the predictor!

Assumption: Linearity

Multiple regression

Multiple regression

  • When to use?
  • Multiple regression model with two predictors
  • Main questions
    • What does the model look like?
    • Explanatory power of both predictors together?
    • Which predictor is most important?

How to explain differences in…

Income?

Multiple Regression: unique effects

Aim: predict dependent variable Y from multiple predictors \(X_1, X_2, \ldots,X_k\) with a linear model:

\(y_i = b_0 + b_1 * x_1 + b_2 * x_2 + \ldots + b_k * x_k + \epsilon_i\)

This will give you the unique/partial effect of each predictor, while keeping all other variables constant

When to use?

  • To make better predictions using all available predictors
  • To compare relative importance of different predictors
  • To improve causal inference

Standardized regression coefficients

Standardizing regression coefficients

Problem: We want to know how important different predictors are

Problem: We want to compare the effect of the same variable across two studies

Solution: Standardize the regression coefficient to make them ~comparable (but there are limitations)

What is standardized regression coefficient

It’s just the regression coefficient you would get IF you carried out the analysis after standardizing the X and Y variables

Instead of X and Y, we use Z-scores:

\(Z_x = (X - \bar{X}) / SD_x\) \(Z_y = (Y - \bar{Y}) / SD_y\)

Z-scores: mean = 0, SD = 1

Z-scores lose the original units of a variable. The new unit is the SD: a Z-score of 1.3 means “1.3 standard deviations above the mean”

Interpretation

Unstandardized

A one-point increase in X is associated with a \(b\)-point increase in Y

Standardized

A one-SD increase in X is associated with a \(\beta\) SD increase in Y

When to use (un)standardized coefficients?

Unstandardized

  • If the units are meaningful/important (e.g., years, euros, centimeters, number of questions correct)
  • If there are (clinical) cut-off scores

Standardized

  • When units are not meaningful (e.g., depression, need to belong, job satisfaction, Likert scales).
  • If you want to compare effect sizes / variable importance

In this course, we will mostly focus on the unstandardized regression coefficient b

Causality

Causality

  • Often, we want to find causal relationships: X -> Y
    • Treatment, Policy decisions, Investments
  • Causality can only be established using experiments, or assumed based on theory
  • If our theory implies alternative explanations, we can account for these using multiple regression
    • Our theory could be wrong. In this case, our analysis can give misleading results

Types of multivariate relationships

  • Spurious association between X and Y (there is common “cause” to both)
  • Mediation effect X -> M -> Y (chain relationship)
  • Multiple causes (e.g., job performance = motivation + education)
  • Interaction: (the effect of X on Y depends on the level of a third variable, M)

Types of multivariate relationships

Types, continued

Confounders

  • A confounding variable causes BOTH X and Y
  • This inflates the observed relationship between X and Y
  • Assuming that your model is correct, controlling for confounders improves causal inference
  • Common suspected confounders are gender, age, education
    • Controlling for a variable that is not causally related to the outcome can bias your results, so don’t put EVERYTHING in the model without good reason

Confounder example

Confounder example

Confounder example 2

Confounder example 2

More next week…

… then we will carry out multiple regression analysis with two predictors!

Reference material for next week:

  • Book 2(!): Chapter 4: 4.6 and 4.7.3
  • Chapter 5: 5.1, 5.8 and 5.9